Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve to_dask_dataframe performance #7844

Merged
merged 5 commits into from
May 25, 2023

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented May 15, 2023

  • ds.chunks loops all the variables, do it once.
  • Faster to create a meta dataframe once than letting dask guess 2000 times.

@Illviljan Illviljan mentioned this pull request May 15, 2023
6 tasks
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label May 16, 2023
@@ -6422,8 +6429,13 @@ def to_dask_dataframe(
if not is_duck_dask_array(var._data):
var = var.chunk()

dask_array = var.set_dims(ordered_dims).chunk(self.chunks).data
series = dd.from_array(dask_array.reshape(-1), columns=[name])
if has_many_dims:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really that impactful, can we optimize set_dims instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think I'll save the has_many_dims paths for a future PR. I think it might introduce bugs if we don't consistently chunk with the same shape.

@Illviljan
Copy link
Contributor Author

Illviljan commented May 21, 2023

        before           after         ratio
     [05c7888d]       [d135ab97]
-      2.47±0.02s          806±6ms     0.33  pandas.ToDataFrameDask.time_to_dataframe

@Illviljan Illviljan added the plan to merge Final call for comments label May 24, 2023
@Illviljan Illviljan merged commit 609a901 into pydata:main May 25, 2023
@dcherian dcherian mentioned this pull request Jun 15, 2023
19 tasks
dstansby pushed a commit to dstansby/xarray that referenced this pull request Jun 28, 2023
* Improve to_dask_dataframe performance

* Add ASV test

* Update pandas.py

* Update dataset.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants